CELL Perspectives publication figures

The below Rmarkdown will guide you through the steps that generated the graphs used in the publications figures.

NOTE: all the scripts that are being sourced below, are present inside the figures/R folder in the publication’s GitHub repo.

1. Load/ install all the required packages and custom scripts

source(file = "./R/load_packages.R")

2. Import the Data and the MetaData to be used

source(file = "./R/load_data.R")

3. Pre-process the data

source(file = "./R/preprocess_data.R")

4. Generate figure graphs

4.1 Figure: Spatial Autocorrelation (SA)

4.1.1 Prepare for SA calculations

Here we prepare the data to calculate SA. Note that the below code will produce the selected graphs. If you want to have a look at the exploration before the selection then refer to the figure_SA.R script inside the figures/R folder in the publication’s GitHub repo.

NOTE: during the normalisation using Seurat, we selected the 2000 most variable genes (as per Seurat’s default option).

### Get spot names ----
nb_names <- polygons$Barcode

### Get counts for the 2000 most variable genes ----
### The counts are normalised 
### It is important at the last step to order the rows as the vector of names above

SA_counts_Nor <- obs_Nor %>% 
  filter(rownames(.) %in% var_genes) %>%
  t() %>%
  as.data.frame() %>%
  .[nb_names,]

### Find neighbours ----
neighbours_knn <- knn2nb(knearneigh(polygons$geom_cntd, k = 6), 
                         row.names = nb_names)

### Calculate neighbour weights ----
neighbours_w_exp <- nb2listwdist(neighbours_knn, polygons$geom_cntd,
                                 type = "idw", style = "raw", alpha = 2,
                                 zero.policy = TRUE)

### Set plot output prefix ----
prfx <- "countNorm"

### Set active dataset ----
data <- SA_counts_Nor

4.1.2 Calculate GLOBAL Moran’s I for all genes

After exploring the genes for their global Moran’s I value we identified the below two genes:

ENSG00000197971: as the gene with the highest positive Moran’s I (meaning high positive SA)

4.1.3 Calculate LOCAL Moran’s I

## For the selected gene
SA_genes <- "ENSG00000197971"
### Prepare the data ----
gCounts <- data[,SA_genes]
### Calculate Moran's I ----
names(gCounts) <- rownames(data)
moran.multi.local <- localmoran(gCounts,
                                listw = neighbours_w_exp, 
                                spChk = TRUE)

rm(gCounts)

4.1.4 GLOBAL Moran’s I plots and tables

Moran-plot

Moran Stats table

## a flextable object.
## col_keys: `Test type`, `Moran's I`, `P-value`, `Expected`, `Variance` 
## header has 1 row(s) 
## body has 2 row(s) 
## original dataset sample: 
##   Test type Moran's I  P-value  Expected Variance
## 1   Z-score    0.8006 <0.00001 -0.000277  9.3e-05
## 2   MC perm    0.8006    0.001       N/A      N/A

Normalised gene expression maps

NOTE: the counts are Seurat-normalised and represented as log2-transformed on the map. The transformation took place as log2(x + 1).

## Joining with `by = join_by(Barcode)`

4.1.5 LOCAL Moran’s I plots

The local Moran plots are generated only for the ENSG00000197971 gene with positive SA.

Prepare the data

### Prepare the data ----
quadr <- attr(moran.multi.local, "quadr")
  
moran.local.map <- moran.multi.local %>% 
  as.data.frame() %>% 
  rownames_to_column(var = "Barcode") %>% 
  dplyr::rename("p.value" = `Pr(z != E(Ii))`) %>% 
  left_join(polygons[,"Barcode"]) %>%
  mutate(Quadrant = quadr$pysal,
         colours = Quadrant,
         colours = as.factor(case_when(p.value > 0.05 ~ "#ffffff",
                                       Quadrant == "Low-Low" ~ "#0066ff",
                                       Quadrant == "High-Low" ~ "#ffb3b3",
                                       Quadrant == "Low-High" ~ "#99c2ff",
                                       Quadrant == "High-High" ~ "#ff0000")),
         label = as.factor(case_when(p.value > 0.05 ~ "Not signif.",
                                     Quadrant == "Low-Low" ~ "Low-Low",
                                     Quadrant == "High-Low" ~ "High-Low",
                                     Quadrant == "Low-High" ~ "Low-High",
                                     Quadrant == "High-High" ~ "High-High")),
         signif = as.factor(case_when(p.value < 0.05 ~ "Signif.",
                            p.value >= 0.05 ~ "Not Signif.")),
         b_geom = if_else(signif == "Signif.", 
                          geom_pol, 
                          NA)) 

Local Moran Map

Local Moran clusters

4.1.6 Plot the negative control

Here we randomised the gene expression of gene ENSG00000197971 over space to generate a negative control and showcase the importance of space in spatial data.

Prepare the data

g = "ENSG00000197971"

order <- sample(1:nrow(data), size = nrow(data), replace = FALSE)

N_control <- data %>%
  dplyr::select(all_of(g)) %>%
  rownames_to_column(var = "Barcode") %>% 
  mutate(Barcode = Barcode[order]) %>%
  left_join(polygons[, c("Barcode", "geom_cntd")])

rownames(N_control) <- N_control$Barcode

neighbours_knn_N_control <- knn2nb(knearneigh(N_control$geom_cntd, k = 6), 
                                   row.names = rownames(N_control))

neighbours_N_Control <- nb2listwdist(neighbours_knn_N_control, N_control$geom_cntd,
                                     type = "idw", style = "raw", alpha = 2,
                                     zero.policy = TRUE)

Normalised gene expression map

NOTE: the counts are Seurat-normalised and represented as log2-transformed on the map. The transformation took place as log2(x + 1).

Moran Stats table

## a flextable object.
## col_keys: `Test type`, `Moran's I`, `P-value`, `Expected`, `Variance` 
## header has 1 row(s) 
## body has 2 row(s) 
## original dataset sample: 
##   Test type Moran's I  P-value  Expected Variance
## 1   Z-score     0.002 0.405998 -0.000277  9.3e-05
## 2   MC perm     0.002    0.411       N/A      N/A

Local Moran

## For the selected gene
SA_genes <- "ENSG00000197971"
### Prepare the data ----
gCounts <- N_control[,SA_genes]
### Calculate Moran's I ----
names(gCounts) <- N_control$Barcode
moran.multi.local <- localmoran(gCounts,
                                listw = neighbours_N_Control, 
                                spChk = TRUE)

rm(gCounts)

4.2 Figure: Modifiable Areal Unit Problem (MAUP)

4.2.1 Prepare for MAUP calculations

Here we prepare the data to calculate MAUP. Note that the below code will produce the selected graphs. If you want to have a look at the exploration before the selection then refer to the figure_MAUP.R script inside the figures/R folder in the publication’s GitHub repo.

NOTE 1: here we selected to work with Seurat-Normalised counts.
NOTE 2: we filtered out genes that were expressed in less than 1000 spots.

### Get counts to be used ----
select <- SA_counts_Nor %>% 
  t() %>%
  as.data.frame() %>%
  filter(rowSums(. != 0) > 1000) %>% 
  rownames(.)
  
cor.input <- SA_counts_Nor %>%
  dplyr::select(all_of(select))

### Set plot output prefix ----
prfx <- "countNorm"

### Get a set of colours ----
## Colourblind-friendly colours using cols4all
## c4a palette: c4a.bu_br_bivs
cols_grid <- c("#4992C8", "#AECBEA")

## c4a palette: misc.okabe
cols_lay <- c("#E69F00", "#56B4E9", "#009E73",
              "#F0E442", "#0072B2", "#D55E00", 
              "#CC79A7")

4.2.2 Calculate correlations

To showcase the change in the correlation statistic as aggregation changes we selected genes that had a low positive Spearman R initially which flipped to a negative Spearman R at the “layer” level of aggregation. This was selected in accordance with a reviewer comment which argued that a flip from a positive to a negative trend makes the MAUP effect illustrated better. More specifically the selection pool consisted of genes that fulfilled all 3 of the below filters:

Aggregation level Filter
No aggregation R > 0
Grid-like aggregation R > 0.1
Biological layers R < -0.7

A genes pair was selected in random to produce the visualisation:

gene_X gene_Y R
ENSG00000110484 ENSG00000111341 0.18
## Spearman correlations between all genes
cor.multi <- cor(cor.input, cor.input, method = "spearman")
diag(cor.multi) <- 0 #set diagonal to zero

gene_x <- "ENSG00000110484"
gene_y <- "ENSG00000111341"

4.2.3 Create zonation maps

We showcase the MAUP in two different zone types.
1. A grid.
2. The tissue regions.

For the grid we need a 361-groups grid. This is so that we reduce the number of observations by one order of magnitude from the 3610 individual spots we have initially in the data set. To find how many breaks we should have on each side of the tissue we calculate: sqrt(361) = 19 so we need roughly 19 groups on the X and 19 on the Y axis. The X has a length of 115 and the Y of 76. Therefore, 115/19 = 6.05 (so we need to break every 6 spots) and 76/19 = 4 (so we need to break every 4 spots).

NOTE: because the tissue is not a perfect square the groups are eventually going to be less than 361.

### Prepare the data ----
## Get the range of the locations
rangeX <- range(meta_D[meta_D[, "Section"] == 1, "Spot_X"])
rangeY <- range(meta_D[meta_D[, "Section"] == 1, "Spot_Y"])

## Store the information in columns denoting groups.

dt_MAUP <- polygons %>% 
  left_join(meta_D[,c("Barcode", "Spot_X", "Spot_Y")], by = "Barcode") %>%
  mutate(group_X = as.numeric(cut(.data[["Spot_X"]], 
                                  breaks = seq(rangeX[1], rangeX[2], by = 6), 
                                  include.lowest = TRUE)),
         group_Y = as.numeric(cut(.data[["Spot_Y"]], 
                                  breaks = seq(rangeY[1], rangeY[2], by = 4), 
                                  include.lowest = TRUE)),
         group_361 = factor(paste(group_X, group_Y, sep = "_")))

## The grid-like grouping will have a two-colour scheme
cols <- rep(cols_grid, 181)

colours <- expand.grid(group_X = seq(1:19),
                       group_Y = seq(1:19)) %>%
  mutate(group_361 = factor(paste(group_X, group_Y, sep = "_"))) %>%
  mutate(cols = cols[1:nrow(.)]) %>%
  dplyr::select(cols, group_361)

dt_MAUP <- dt_MAUP %>%
  left_join(colours)

4.2.4 Calculate correlations for zones

As a summary statistic we selected the mean expression because the mean is not affected by the size of each area. On the other hand, the sum is affected by how big is the area (If a gene is almost uniformly expressed throughout, then: number of spots in area goes UP –> gene expression goes UP –> correlation goes UP).

### Prepare the data ----
## Add zones to the correlation input df
cor.input <- cor.input %>%
  rownames_to_column(var = "Barcode") %>%
  left_join(dt_MAUP[,c("Barcode", "group_361", "layer")]) %>%
  st_drop_geometry() %>% 
  dplyr::select(matches("[^geom_pol]")) %>%
  column_to_rownames(var = "Barcode") 

## Generate the grouped correlation inputs
groups <- c("group_361", "layer")

for (g in groups) {
  cIn <- cor.input %>%                                     # get the correlation input
    dplyr::select(matches(paste0("ENSG|", g))) %>%         # select one group type
    group_by_at(vars(g)) %>%                               # group by group type and gene
    summarise(across(matches("ENSG"), tibble::lst(mean)))  # summarise per group per column
  
  out <- cIn %>%
    dplyr::select(ends_with("_mean")) %>% 
    ungroup() %>%
    select_all(~gsub("_mean", "", .))
  
  assign(paste0("cor.input_", g, "_mean"), out)
  
  rm(g, cIn, out)
}
## Working on: cor.input_group_361_mean
## Working on: cor.input_layer_mean
## Working on: cor.input_group_361_mean
## Working on: cor.input_layer_mean

4.2.5 Empty tissue map.

This map will be used for the correlations.

4.3 Figure: Spatial Heterogeneity (SH)

4.3.1 Prepare for SH calculations

Here we prepare the data to calculate SH Note that the below code will produce the selected graphs. If you want to have a look at the exploration before the selection then refer to the figure_SH.R script inside the figures/R folder in the publication’s GitHub repo.

NOTE 1: The list of 2000 genes from Seurat comes after variance stabilising and selecting the most variable genes. We want to narrow this further down and get a pool of genes that are variable and expressed in more than 1000 locations on the tissue.

NOTE 2: After filtering the pool of genes was reduced to 435 genes.

### Get Seurat-normalised data ----
data <- SA_counts_Nor

### Select a set of genes to test ----
## The list of 2000 genes from Seurat comes after variance stabilising and 
## selecting the most variable genes. We want to narrow this further down and 
## get a pool of genes that are variable and expressed in more than 2/3 of the 
## tissue area.
#### Filter for number of locations expressed ----
SH_genes <-  data %>%
  t() %>%
  as.data.frame() %>%
  filter(rowSums(. != 0) > 1000) %>% 
  rownames(.)

### Add geometries and convert to sp format ----
dt_SH <- data %>%
  dplyr::select(all_of(SH_genes)) %>%
  rownames_to_column(var = "Barcode") %>%            # barcodes from row names to column
  left_join(polygons[,c("geom_pol", "Barcode")]) %>% # merge with geometries
  column_to_rownames(var = "Barcode") %>%            # barcodes back to row names
  st_as_sf(., sf_column_name = "geom_pol") %>%       # data frame to sf
  as(., "Spatial")                                   # sf to sp - GWmodel still works with sp objects
## Joining with `by = join_by(Barcode)`
### Set plot output prefix ----
prfx <- "countNorm"

## Housekeeping
rm(data)

4.3.2 Run multiple linear/ global regressions

Here we run a linear regression model for all possible combinations for the 435 selected genes, excluding self-pairs.

After running the regressions, we sorted the pairs based on their R2 and selected 26 pairs with R2 > 0.4.

These 26 pairs were fed to the GWR model.

4.3.3 Run GWR

From the initial 26 pairs, the one below was selected to showcase SH.

  • ENSG00000091513~ENSG00000197971
### Determine the kernel bandwidth ----
bw <- bw.gwr("ENSG00000091513~ENSG00000197971",
             approach = "AIC",
             adaptive = T,
             data = dt_SH)
## Take a cup of tea and have a break, it will take a few minutes.
##           -----A kind suggestion from GWmodel development group
## Adaptive bandwidth (number of nearest neighbours): 2238 AICc value: 8302.722 
## Adaptive bandwidth (number of nearest neighbours): 1391 AICc value: 8202.582 
## Adaptive bandwidth (number of nearest neighbours): 866 AICc value: 8139.5 
## Adaptive bandwidth (number of nearest neighbours): 543 AICc value: 8073.116 
## Adaptive bandwidth (number of nearest neighbours): 342 AICc value: 8026.095 
## Adaptive bandwidth (number of nearest neighbours): 219 AICc value: 8005.941 
## Adaptive bandwidth (number of nearest neighbours): 141 AICc value: 8010.028 
## Adaptive bandwidth (number of nearest neighbours): 265 AICc value: 8011.159 
## Adaptive bandwidth (number of nearest neighbours): 188 AICc value: 8003.862 
## Adaptive bandwidth (number of nearest neighbours): 171 AICc value: 8005.641 
## Adaptive bandwidth (number of nearest neighbours): 200 AICc value: 8004.689 
## Adaptive bandwidth (number of nearest neighbours): 181 AICc value: 8003.765 
## Adaptive bandwidth (number of nearest neighbours): 176 AICc value: 8004.691 
## Adaptive bandwidth (number of nearest neighbours): 183 AICc value: 8003.338 
## Adaptive bandwidth (number of nearest neighbours): 185 AICc value: 8003.593 
## Adaptive bandwidth (number of nearest neighbours): 182 AICc value: 8003.637 
## Adaptive bandwidth (number of nearest neighbours): 184 AICc value: 8003.728 
## Adaptive bandwidth (number of nearest neighbours): 182 AICc value: 8003.637 
## Adaptive bandwidth (number of nearest neighbours): 183 AICc value: 8003.338
## The bandwidth is an adaptive bandwidth (kernel size) indicating that its size 
## varies but the nearest 183 observations will be used to compute each local 
## weighted regression. 
## Here it, has a value of 183 indicating that 5% of the tissue data will be 
## used in each local calculation (there are 3610 records in the data):
message("Bandwidth (bw): ", bw)
## Bandwidth (bw): 183
gwr <- gwr.basic("ENSG00000091513~ENSG00000197971", 
                 adaptive = T,
                 data = dt_SH,
                 bw = bw)

gwr
##    ***********************************************************************
##    *                       Package   GWmodel                             *
##    ***********************************************************************
##    Program starts at: 2023-08-04 13:35:36 
##    Call:
##    gwr.basic(formula = "ENSG00000091513~ENSG00000197971", data = dt_SH, 
##     bw = bw, adaptive = T)
## 
##    Dependent (y) variable:  NA
##    Independent variables:  
##    Number of data points: 3610
##    ***********************************************************************
##    *                    Results of Global Regression                     *
##    ***********************************************************************
## 
##    Call:
##     lm(formula = formula, data = data)
## 
##    Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8116 -0.6743  0.0350  0.6262  4.5189 
## 
##    Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
##    (Intercept)     -1.01147    0.03884  -26.04   <2e-16 ***
##    ENSG00000197971  0.65023    0.01161   56.00   <2e-16 ***
## 
##    ---Significance stars
##    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
##    Residual standard error: 0.8197 on 3608 degrees of freedom
##    Multiple R-squared: 0.465
##    Adjusted R-squared: 0.4648 
##    F-statistic:  3136 on 1 and 3608 DF,  p-value: < 2.2e-16 
##    ***Extra Diagnostic information
##    Residual sum of squares: 2424.145
##    Sigma(hat): 0.8196832
##    AIC:  8813.13
##    AICc:  8813.137
##    BIC:  5246.279
##    ***********************************************************************
##    *          Results of Geographically Weighted Regression              *
##    ***********************************************************************
## 
##    *********************Model calibration information*********************
##    Kernel function: bisquare 
##    Adaptive bandwidth: 183 (number of nearest neighbours)
##    Regression points: the same locations as observations are used.
##    Distance metric: Euclidean distance metric is used.
## 
##    ****************Summary of GWR coefficient estimates:******************
##                         Min.   1st Qu.    Median   3rd Qu.   Max.
##    Intercept       -2.450513  0.020129  0.256662  0.617590 7.8805
##    ENSG00000197971 -0.947430  0.013307  0.106291  0.260641 0.9704
##    ************************Diagnostic information*************************
##    Number of data points: 3610 
##    Effective number of parameters (2trace(S) - trace(S'S)): 144.4093 
##    Effective degrees of freedom (n-2trace(S) + trace(S'S)): 3465.591 
##    AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): 8003.338 
##    AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): 7890.709 
##    BIC (GWR book, Fotheringham, et al. 2002,GWR p. 61, eq. 2.34): 5030.39 
##    Residual sum of squares: 1827.128 
##    R-square value:  0.5967473 
##    Adjusted R-square value:  0.5799391 
## 
##    ***********************************************************************
##    Program stops at: 2023-08-04 13:35:37
gwr.tab <- apply(gwr$SDF@data[, 1:7], 2, summary)
gwr.tab <- round(gwr.tab, 1)
t(gwr.tab[,1:2])
##                 Min. 1st Qu. Median Mean 3rd Qu. Max.
## Intercept       -2.5       0    0.3  0.5     0.6  7.9
## ENSG00000197971 -0.9       0    0.1  0.2     0.3  1.0
## Housekeeping
rm(gwr.tab)